Goto

Collaborating Authors

 return 1




Self-Directed Learning of Convex Labelings on Graphs

Sokolov, Georgy, Thiessen, Maximilian, Akhmejanova, Margarita, Vitale, Fabio, Orabona, Francesco

arXiv.org Machine Learning

We study the problem of learning the clusters of a given graph in the self-directed learning setup. This learning setting is a variant of online learning, where rather than an adversary determining the sequence in which nodes are presented, the learner autonomously and adaptively selects them. While self-directed learning of Euclidean halfspaces, linear functions, and general abstract multi-class hypothesis classes was recently considered, no results previously existed specifically for self-directed node classification on graphs. In this paper, we address this problem developing efficient algorithms for it. More specifically, we focus on the case of (geodesically) convex clusters, i.e., for every two nodes sharing the same label, all nodes on every shortest path between them also share the same label. In particular, we devise a polynomial-time algorithm that makes only $3(h(G)+1)^4 \ln n$ mistakes on graphs with two convex clusters, where $n$ is the total number of nodes and $h(G)$ is the Hadwiger number, i.e., the size of the largest clique minor of the graph $G$. We also show that our algorithm is robust to the case that clusters are slightly non-convex, still achieving a mistake bound logarithmic in $n$. Finally, for the more standard case of homophilic clusters, where strongly connected nodes tend to belong the same class, we devise a simple and efficient algorithm.


LLM4Decompile: Decompiling Binary Code with Large Language Models

Tan, Hanzhuo, Luo, Qi, Li, Jing, Zhang, Yuqun

arXiv.org Artificial Intelligence

Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in Large Language Models (LLMs), we propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce the LLM4Decompile-End models to decompile binary directly. The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks by over 100%. Additionally, we improve the standard refinement approach to fine-tune the LLM4Decompile-Ref models, enabling them to effectively refine the decompiled code from Ghidra and achieve a further 16.2% improvement over the LLM4Decompile-End. LLM4Decompile demonstrates the potential of LLMs to revolutionize binary code decompilation, delivering remarkable improvements in readability and executability while complementing conventional tools for optimal results. Our code, dataset, and models are released at https://github.com/albertan017/LLM4Decompile


On the Trade-off between the Number of Nodes and the Number of Trees in a Random Forest

Akutsu, Tatsuya, Melkman, Avraham A., Takasu, Atsuhiro

arXiv.org Artificial Intelligence

In this paper, we focus on the prediction phase of a random forest and study the problem of representing a bag of decision trees using a smaller bag of decision trees, where we only consider binary decision problems on the binary domain and simple decision trees in which an internal node is limited to querying the Boolean value of a single variable. As a main result, we show that the majority function of $n$ variables can be represented by a bag of $T$ ($< n$) decision trees each with polynomial size if $n-T$ is a constant, where $n$ and $T$ must be odd (in order to avoid the tie break). We also show that a bag of $n$ decision trees can be represented by a bag of $T$ decision trees each with polynomial size if $n-T$ is a constant and a small classification error is allowed. A related result on the $k$-out-of-$n$ functions is presented too.


Teaching Large Language Models to Self-Debug

Chen, Xinyun, Lin, Maxwell, Schärli, Nathanael, Zhou, Denny

arXiv.org Artificial Intelligence

Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair approaches to improve code generation performance. In this work, we propose Self-Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations. In particular, we demonstrate that Self-Debugging can teach the large language model to perform rubber duck debugging; i.e., without any human feedback on the code correctness or error messages, the model is able to identify its mistakes by investigating the execution results and explaining the generated code in natural language. Self-Debugging achieves the state-of-the-art performance on several code generation benchmarks, including the Spider dataset for text-to-SQL generation, TransCoder for C++-to-Python translation, and MBPP for text-to-Python generation. On the Spider benchmark where there are no unit tests to verify the correctness of predictions, Self-Debugging with code explanation consistently improves the baseline by 2-3%, and improves the prediction accuracy on problems of the hardest level by 9%. On TransCoder and MBPP where unit tests are available, Self-Debugging improves the baseline accuracy by up to 12%. Meanwhile, by leveraging feedback messages and reusing failed predictions, Self-Debugging notably improves sample efficiency, and can match or outperform baseline models that generate more than 10x candidate programs.


Deep Learning -- Algorithm

#artificialintelligence

As said above a neuron is in reality a function who give X in input parameter and give a Y in output.The simplest function is the aggregation function. The creation of neuron that realize in two part we have a aggregation function and activation function. There is not only one activation function. We have also linear function and other. Actually the algorithm resolve only linear problems but we would resolve complex problem.


Automatic Identification of Indicators of Compromise using Neural-Based Sequence Labelling

Zhou, Shengping, Long, Zi, Tan, Lianzhi, Guo, Hao

arXiv.org Artificial Intelligence

Indicators of Compromise (IOCs) are artifacts observed on a network or in an operating system that can be utilized to indicate a computer intrusion and detect cyber-attacks in an early stage. Thus, they exert an important role in the field of cybersecurity. However, state-of-the-art IOCs detection systems rely heavily on hand-crafted features with expert knowledge of cybersecurity, and require a large amount of supervised training corpora to train an IOC classifier. In this paper, we propose using a neural-based sequence labelling model to identify IOCs automatically from reports on cybersecurity without expert knowledge of cybersecurity. Our work is the first to apply an end-to-end sequence labelling to the task in IOCs identification. By using an attention mechanism and several token spelling features, we find that the proposed model is capable of identifying the low frequency IOCs from long sentences contained in cybersecurity reports. Experiments show that the proposed model outperforms other sequence labelling models, achieving over 88% average F1-score.


Neural Network

#artificialintelligence

This Emergent Mind project (#10!) implements a JavaScript-based neural network with back-propagation that can learn various logical operators. To begin the learning process, simply click the Start button above. By default the neural network will learn how to map an XOR operator, but you can change the operator it's trying to learn by changing the training set that it's using to teach the neural network. As the neural network learns how to map the operator, its predictions will become closer and closer to what the operator actually returns. For example, the XOR function should return 1 only when exactly one of its inputs is a 1: 00 should return 0, 01 should return 1, 10 should return 1, and 11 should return 0. At first the neural network's predictions will be completely random, but as each epoch passes and we train the neural network on what the output should be for that operator, its predictions will become closer and closer to the correct value.